In this report, we will study the microarray expression of patients with stage III melanoma. More particularly, we will use the dataset to predict if a patient is still alive or not. First of all, we obtain the dataset GSE54467 from Gene Expression Omnibus which we can store as an Expression Set object using the Biobase package. The phenotype data contains some relevant information for each patient and can be visualized in the table below.
set.seed(2022) #set the seed to 2022 to obtain the same results every time the code is ran
library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)
# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
gset <- getGEO("GSE54467", GSEMatrix = TRUE, getGPL = FALSE, destdir = ".")[[1]]
## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))
head(pheno)
However, the most important data for our analysis us the gene expression which we can extract from the gset object using the exprs() function. Looking at the boxplot representing the distribution of the first five genes we can see that the data has already been normalised.
## Get gene expression data
expVal <- exprs(gset)
boxplot(expVal[, 1:5])
## Get gene annotations
geneAnno <- fData(gset)
The Dataframe for the 26,085 genes and 79 pateints can be analysed below.
as.data.frame(expVal)